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Introduction 
Description of the motivation for undertaking the task of creating selective 
transparent headphones. 


The advancement of signal processing technologies has consistently 
improved the enjoyment of listening to music. For example, noise 
cancelling techniques have been widely applied to headphones that, in 
addition to physical noise isolation, provide further reduction in noise. 


We believe, however, that it is sometimes important for music listeners with 
a pair of good noise isolating headphone to be aware of certain surrounding 
signals. For example, when people are crossing the street while listening to 
music, they really want to hear car horns to be warned of possible danger. 
We also think that human speech is important, and that people may want to 
hear others talking to them while they are listening to music. 


Based on the assumption above, we propose to build a selective transparent 
headphone that propagates certain surrounding sound signals and 
suppresses all other signals. In this project, we explore possibility to select 
only speech signal to propagate and silence other surrounding sounds. 


The following sections are organized as follows. The second section 
provides an overview of the project, defining the problem and explaining 
high level system implementation. The third to the fifth sections explain in 
detail the blind source separation algorithm and binary artificial neural 
network, respectively. The following sections discuss experiment results, 
and suggest next steps. 


Project Overview 
Overview of the steps followed to create the system. 


Problem Definition 


There are a number of sound signals that are of importance in the 
environment. In this project, we identify human speech as an important 
signal, that we would like to focus on. We would like to separate the 
environment sound signals into a signal containing only human speech and 
a signal containing all other speech. By making the above simplifications, 
the problem is reduced to finding the speech content in the surrounding 
environment, and forwarding the speech content to the listener while 
suppressing other signals. If there are no speech signals, or speech signals 
are weak compared to other signals, all sounds from the environment will 
be attenuated. 


System Implementation 


We approach the problem stated above by first separating the source signals 
into human speech content and non-human speech with a blind source 
separation algorithm. After that, a classification algorithm determines 
which signal contains human speech, if any, and outputs that signal. A 
detailed overview is shown below with the system block diagram in Figure 
i, 
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The audio input, a mixture of human speech and some other noise such as 
instrumental music, is passed to the blind source separation block. Inside 
this block, short time Fourier transform is first performed to change the 
time domain input signal into frequency domain, since all the other 
operations on the signals reside in frequency domain. After that a 
preprocessing filter is applied, which cleans the input signal by removing 
reflection and ambient noises. Then for each frequency, the independent 
component analysis algorithm separates the signal into two parts such that 
independence between the two signals is maximized. Scaling and 


permutation filters minimizes distortions caused by using the independent 
component analysis method. The blind source separation process outputs 
two signals, where at most, one signal contains speech. 


We then implement a binary artificial neural network (ANN) for 
classification. The two identical ANN take the two signals separated from 
the source as inputs respectively and outputs a weight for each signal. We 
train the neural network in the way that a signal containing more human 
speech will output a higher weight. Based on the weights of the separated 
signals, the output selection multiplexer chooses the proper signal to output. 
That is, if one signal, referred to as A, has a higher weight than the other 
signal, referred to as B, and its weight is above a certain threshold (0.5), the 
multiplexer outputs signal A. If both the weights of signal A and signal B 
are below that threshold, then the multiplexer outputs nothing. We set a 
threshold to deal with situation where both signal contain no human speech. 


Background on Independent Component Analysis 

Describes the motivation and background behind using independent 
component analysis as the basis of the blind source separation algorithm 
used in this endeavor. 


The blind source separation algorithm, we employed, is based around 
independent component analysis, which is explained in more detail below. 
The goal is to recover independent sources given only sensor observations 
that are linear mixtures of independent source signals. We assume that the 
source signals are statistically independent and non-Gaussian. 


ICA works by solving the following model. 
T= As 


Where vector x contains n observed signals, vector s contains independent 
sources that comprise x, and matrix A denotes the mixing matrix. It is 
assumed that the mixing matrix A and the source vector x are both 
unknown. ICA algorithm thus produces the best estimation for both the 
mixing matrix and source vector. 


The first step in ICA is usually whitening (sphering) the data, which 
removes any correlations in the data, i.e. the signals are forced to be 
uncorrelated. Putting the words in mathematical terms, we seek a linear 
transformation V such that when y = Vx we now have E[yy’'] = I. 


After sphering, the separated signals can be found by an orthogonal 
transformation of the whitened signals y (this is simply a rotation of the 
joint density). The appropriate rotation is sought by maximizing the non- 
normality of the marginal densities. This is because of the fact that a linear 
mixture of independent random variables is necessarily more Gaussian than 
the original variables. 


Although in theory ICA algorithm can perfectly separate source signals 
from observed signals, provided that sources are statistically independent 
and non-Gaussian, most of the algorithms do not work quite well in real 
world scenarios. For example, ICA does not typically deal with reflection; 
that is, if mixed signal is observed in a small room with reverberation, ICA 
algorithms such as FastICA hardly identifies individual components from 


the observed signal. Therefore, we need a more robust blind source 
separation algorithm, which is discussed in the following section. 


Blind Source Separation 

Describes the method to separate an arbitrary signal into its independent 
components for the application of creating a selective transparent 
headphone. 


Background 


When separating mixtures of instantaneously mixed signals, independent 
component analysis works very well, but this ideal situation is rarely found 
in the real-world. In practical applications, the environment distorts audio 
signals by adding echoes, reflections, and ambient noise. Additionally 
independent component analysis, in its purest form, assumes that source 
signals do not have any propagation delay, which is an assumption that 
cannot be applied in this case. Recording sources from two microphones 
placed in different locations will inevitably introduce propagation delays, so 
the blind source separation method used must also consider this issue. 


To solve problems detailed above, the blind source separation problem will 
be redefined in the time-frequency domain. By taking the short-time Fourier 
transform (STFT) of the audio inputs, we can represent the inputs as the 
following. 


r(w, t) = [X1(w, ¢), X2(w,t), 4 XM(W, t)|* 


Xi(@,t) refers to the input observed at the ith microphone observed at time 
t. The T symbol refers to the transpose, so in this case, the input sources are 
represented along the rows. For our application, we will assume that the 
number of observed inputs and the number of separated sources are both 
equal to the constant M. We can, now, formulate our blind source separation 
problem as the following. 


x(w,t) = A(w)s(w, t) + n(w, t) 


In this case, s(@,t), refers to the vector of source signals observed at 
frequency @ and at time t. The second term, n(@,t), represents any type of 
noise or distortions that may be present in the observed signals (noise, 
reflections, etc.). The blind source separation method attempts to solve for 
the mixing matrix, A(@), which can be represented as the following. 


Ajj (w) = Hi; ; (w)eI¥Ti.3 


Here, Hi,j(@), refers to the transfer function of the jth source to the ith 
microphone. Additionally, ti,j, refers to the propagation delay from the jth 
source to the ith microphone. 


The system, for which our problem is defined on, is now redefined from an 
instantaneous mixture of signals to a convolutional mixture in the time- 
domain of the following form. 


“o(t) = a(t) *s() 


This formulation allows for a solution which considers the distortion, 
n(q,t), added by the environment and considers the inherent propagation 
delay that is inherently present in our application. 


Since the new formulation of the blind source separation problem is a 
convolutional system in the time-domain, we will solve the separation 
problem in the time-frequency domain formulation shown in Figure 2. 
Given our approach outlined above, we can solve for the complete 
frequency-domain separation filter shown below. 


G(w) = Pw)Biw)Bw) 


B(q@) is the separation filter of the system. Applying this filter will separate 
the spectrum of the observed signal into the spectra of the source signals 


observed at frequency, w. As stated above, by reformulating the problem 
into the frequency domain, the convolutional mixture becomes transformed 
into an instantaneous mixture. As such, independent component analysis 
can be applied to separate the source signals at each frequency. However, in 
doing so, we introduce other problems, inherent to the independent 
component analysis method, that the other filters attempt to resolve. 


The following sections will explain the three filters in detail. 


¢ B(q), the separation filter. 

¢ Bm+, the normalization filter used to solve the scaling problem that 
arises after applying independent component analysis. 

e P(q@), the permutation filter used to solve the frequency distortion 
problem that arises after applying independent component analysis. 


Separation Filter 


The separation filter, shown in the previous section can be broken down 
into the following components. 


W(@) is the preprocessing filter used to reduce reflections and ambient 
noise by using the subspace method. U(q) is the filter obtained by applying 
independent component analysis after preprocessing. 


I. Subspace Method 


The spatial correlation, or autocorrelation, matrix at frequency q@ is defined 
below. 


R(w) = Ela(w, t)x” (w, t)] 


Given that there are M inputs and M sources, the resulting matrix will be 
MxM. This matrix can be rewritten in the following way. 


R(w) = AQA? + K 


The spatial 
correlation matrix. 


K = E[n(t)n" (t)] 


The noise 
correlation 
matrix. 


Q = E[s(t)s*(t)] 


The sources' 
cross-spectrum 
matrix. 


In other words, K is the correlation matrix of n(t) and Q is the cross- 
spectrum matrix of the sources. However, since the source signals, s(t), are 
unknown at this point, this equation cannot be directly solved. Instead, we 
can represent the generalized eigenvalue decomposition of the spatial 
correlation matrix in the following manner. 


R= KEAE™! 


E refers is the eigenvector matrix, E = [e1,e2, ...,eM], and A = 
diag(A1,A2,...,AM) refers to the eigenvalues of R. Since K, the noise 
correlation matrix, cannot be directly observed apart from the source 
signals, we will assume that K = I, which will evenly distribute the 
reflections and ambient noise induced by the environment among the 
estimated sources. As such, the eigenvalue decomposition of R can be 
rewritten as the following. 


Raf 
The final preprocessing filter uses the eigevector and eigenvalue matrices 


resulting from the standard eigenvalue decomposition of the spatial 
correlation matrix and is shown below. 


Wee Re 


II. Independent Component Analysis 


By applying the preprocessing filter obtained from the subspace method, 
our observed signals are now of the following form. 


y(w,t) = W(w)e(w,t) 


To estimate the source signals from the preprocessed input signals, we can 
use independent component analysis to solve the linear system displayed 
below, for the filter U(o). 


y(w,t) = U-+(w)8(w, t) 


Now that the separation filter at each frequency has been found, the 
problems that arise from the separation process must be addressed. 


Scaling 


The scaling problem that results from applying independent component 
analysis can be resolved using the filter shown below. 


Bt (w) = diag|By, 1, Bros +> Br al 


Bm,n+(q) refers to the (m,n)th entry in B+(o), which is the pseudo-inverse 
of B(@) (regular inverse also works here since we assume that the number 
of separated sources and the number of inputs are both M, so that the 
resulting separation matrix is square). Additionally, m refers to an arbitrary 
row or microphone in B+(o). By applying this matrix, each signal, i, is 
amplified by the component of signal i observed at microphone or input m. 
Essentially, this filter amplifies the frequency components of the separated 
source signals so that the waveforms of the resulting sources will be 
distinguishable and audible. 


Permutation 


A second critical problem that arises after applying independent component 
analysis is that the order in which the source signals are returned is 
unknown. Since independent component analysis is applied at each 
frequency, we must find the permutation of components that has the highest 
chance of being the correct permutation to reconstruct the source signals 
and minimize the amount of frequency distortion caused by independent 
component analysis. 


We define the permutation matrix, P, as the following. 


Z(w) = [A (w), Z2(w),..., Zr (w)| 


The matrix, P, exchanges the column vectors of Z(o) to get different 
permutations. We define the cosine of 0n between the two vectors zn(@) 
and zn(@0), where w0 is the reference frequency is as follows. 


a, ME bio) 
cos(9n) = Te (oyicz# (wool 


The permutation is, then, determined by the following. 


P = arg maxp F(P) 


The cost function, F(P), used above, is defined as follows. 


This compares the inverse vector filter at each frequency with a filter at a 
reference frequency, w0. However, if we use the same reference frequency 
to find the permutation at every other frequency, then the problem arises if 
the filter at the reference frequency has the incorrect permutation. In order 
to minimize this error, a new reference frequency is chosen after the 
permutations of filters at K frequencies are found. As such, the frequency 
range of a reference frequency, @0, is as follows. 


wo =wW—k- Aw, Vk = 1,...,K 


Background on Artificial Neural Network 
This module talks about the backgrounds on artificial neural network 


Motivation: 


As the Blind Source Separation algorithm randomly assigns the two 
independent components into two output vectors, we need the artificial 
neural network to determine which of the two signals, if any, contains 
human speech. 


Background on Our Chosen Neural Network Model: 
Our Neural Networks Model: 


A simplified schematics of our neural network looks like the following: 


Layer L, Layer L, 


Neural Network Model 


Our artificial neural network consists of one input layer (Layer L1), one 
hidden layer (Layer L2), and one output layer (Layer L3). The circles 
labeled “+1” are called the bias units, and corresponds to the intercept term. 


Our neural network has parameters (W,b) = (Wb, W@),b@), where we 
write wo; to denote the weight associated with the connection between 
unit j in layer 1, and unit i in layer 1+1. Also, b“; is the bias associated with 
unit i in layer 1+1. We also use s; to denote the number of nodes in layer | 
(not counting the bias unit). 


We will write al). to denote the activation of unit i in layer |. For l=1, we 
also use a“); =x; to denote the i-th input. Given a fixed setting of the 
parameters W,b, our neural network defines a hypothesis hw,,(x) that 
outputs a real number. Specifically, the computation that this neural 
network represents is given by: 
2 1 1 1 1 
a? = f(WY 2, + WE ao + WY 23 + 1) 
2 1 1 1 1 
ay? = f(Way) a1 + Way) 2 + Wy3)23 + 03”) 
2 1 1 1 1 
ay” = f(Wsy) a1 + Wyy 2 + Wy3'3 + 65”) 
3 2) .2 2) (2 2) (2 2 
hwa(a) = ay” = f(Wyyay” + Wiz’ a” + Wi3?aS” + bY”) 
Let 2; denote the total weighted sum of inputs to unit i in layer 1, including 
the bias unit. For instance, 
2? = 02, Wea; +P) 


, so that 


i (l 
al _ F(z; y 


If we extend the activation function f(.) to apply to vectors in an element- 
wise fashion, we can write the above equations more compactly as: 


2) = We + 6 

q?) = f( 2)) 

28) — wR,_@ 4. 5°) 
hy-»(2) = q') = y (2°) 


The above equations represents the forward propagation step. More 
generally, as we use a“) =x to denote the values from the input layer, then 
given layer |’s activations a”) , we can compute layer 1+1’s activations a“*) 
as: 


+1) — yrOg 4 AO 
att) — f(z) 


Backpropagation Algorithm: 
Suppose we have a fixed training set 
{(c®, y),. bg (a™ y™)} 


of m training examples, the overall cost function is: 


m nme—-l sy 8 
J(W, 6) = 2 Son b; a) re >> > > (we?) 
i=1 i=1 i=1 j=1 
1 m 1 ' “ii 5 ag St Si41 9 
- Pos (; \lrwe(x) — y®|| ) +5 y > 2 (w}?) 


The first term in the cost function is an average sum-of-squares error term. 
The second term is the regularization term which is used to prevent 


overfitting. The weight decay parameter A controls the relative importance 
of the cost term and the regularization term. 


Our goal is to minimize the cost function J(W,b) as a function of W and b. 
To begin training our neural network, we will initialize each weight Ww; 
and each b©); to a small value near zero. It is important to initialize the 
parameters randomly for the purpose of symmetry breaking. 


In order to perform gradient descent to minimize the cost function, we need 
to compute the derivatives of the cost function with respect to Ww; and b. 
respectively: 


0 lw O -_ ai 

—.. J (W, b) ae eee 5 —.. J (W, b: e® y) + \ws 

(L) 7 (1) 1% ’ ij 
Ow; yo Wi; 


0 Lx 9 dD alt 
ao b) = =D, apd (Wabi a nae”) 


7i=1 a 


The intuition behind the backpropagation algorithm is as follows. Given a 
training example (x,y), we will first run a “forward propagation” to 
compute all the activations, including the output value of the hypothesis 
hw,p (x). Then for each node i in layer 1, we compute an error term 5. that 
measures how much that node was responsible for any errors in the output. 
For an output node, we can directly measure the difference between the 
network’s hypothesis and the true value, and use that to define the error 
term for the output layer. 


Here is the implementation of the backpropagation algorithm in 
MatLab: 


1. Perform a feedforward propagation, computing the activations for 
hidden layer L2 and output layer L3. 
2. For the output layer (layer ny ), set 


509) = —(y— al) » f(z) 


1. For the hidden layer, set 


69 = ((W)T5) e f'() 


1. Compute the desired partial derivative of the cost function with respect 
to W and b 


Vwwod (W, 6b; 2, y) = 5D (Q)P, 
Vuod (W, b; @; y) = ft), 


Implementation note: in step 2 and 3 above, we need to compute f’(z"); ) 
for each value of i. as our f(z) is the sigmoid function and we already have 
a‘). computed during the feedforward propagation process. Thus using the 
expression for f’(z), we can compute this as 


f (2) =a? - a?) 


Below is the pseudo code of the gradient descent algorithm: 
Note: AW“ is a matrix of the same dimension as W” and Ab“ is a vector 
of the same dimension as b") . one iteration of the gradient descent as 


follows: 


1. Set AW := 0, Ab := 0 for all 1. 
2. For i from 1 to m, 


a. Use backpropagation to compute 


Vw (W, 6; a, y) and 
Vind (W, b; ©, Yy).- 


a. Set 
AW® := AW® + VywJ(W, b; 2,y). 


b. Set 


Ab® := Ab! + Viwd (W, b; x, y). 


1. Update the parameters: 


WO =W®-a (aw) + wwe 
m 


19 p99 [as 
m 


We can now repeat the gradient descent steps to reduce our cost function 
J(W,b). 


Implementation and Experimental Results 
This is our implementation of neural network and Blind Source Separation 
and their respective results 


Implementation of Blind Source Separation: 


When implementing the Blind Source Separation algorithm, we chose the 
following parameters, 
Parameters 
Sampling Rate 16 kHz 
Length of STFT 912 


Number of FFT 4096 


Points 

Overlap 492 
Number of inputs 2 
Permutation 5 


Reference Range 


Input Parameter Into The System 


Result of our Blind Source Separation: 


Amplitude 


Amplitude 


Observed Signal One Sampled at 16kHz 


Mixed Signal Observed at 2"? Microphone 


06 zm 
aaa 7 r 5 3 1D 12 
Time x 107 
Mixed Signal Observed at 15‘ Microphone 
Observed Signal Two Sampled at 16Khz 
O6 T T T T T 
04 4 
04 I 
06 al 
a 7 r 5 7 1D 12 
Time x 10° 


Estimated Source Signal One 


o8 T T T T T 


Amplitude 


Estimated Source Signal One-human speech 


Estimated Source Signal Two 
06 T T T T T 


Amplitude 


Estimated Source Signal Two-Instrumental Music 


Figure 2 and Figure 3 show the mixed signals observed at the 1% 
microphone and the 2" microphone respectively. The mixed signal is a 


recording of a person counting from zero to ten with instrumental music 
playing in the background. The intermittent spikes in the mixed signal 
represents the person counting while the continuous, small-amplitude signal 
represents the instrumental music. 


After applying the Blind Source Separation algorithm, we obtain two 
independent signals shown in Figure 4 and Figure 5 respectively. These two 
signals are very distinct from each other. The output signal 1 shown in 
Figure 4 consists mainly of the person counting whereas the output signal 2 
shown in Figure 5 is primarily the instrumental music. 


Implementation of Neural Network: 


Since we need our neural network to perform binary classification, we used 
512 nodes for the input layer, 25 nodes for the hidden layer, and 2 nodes for 
the output layer. 


We used 10,000 positive training examples and 10,000 negative training 
examples to train our neural network. The positive training examples 
mainly consist of audio file containing examples of human speech such as 
news broadcasting. The negative training examples mainly consist of 
examples of non-human speech such as instrumental music. 


After taking 512 point Short-Time Fourier Transform, we place our training 
sets into a 20,000-by-512 matrix where each row represents a single 
training example. Then we used the backpropagation and gradient descent 
algorithm described in the section above to train our neural network. We 
used 70% of the training examples for training, 15% of the training 
examples for validation, and another 15% for testing. 


Result of Our Training of Neural Network: 


After employing backpropagation algorithm and gradient descent algorithm 
to train our neural network, we get the following test results: 


Training Confusion Matrix Validation Confusion Matrix 


Output Class 
Output Class 


1 1 


2 2 
Target Class Target Class 


Test Confusion Matrix All Confusion Matrix 


Output Class 
Output Class 


1 1 


2 2 
Target Class Target Class 


Accuracy of Neural Network 


As our neural network is a binary classification model, our output is a 
vector containing two entries. The first entry represents the probability of 
the input signal being a human speech and the second entry represents the 
probability of the input signal being a non-human speech. The two entries 
sum to one. The overall performance of our neural network is shown in the 


All Confusion Matrix. The red block showing 0.2% means 0.2% of the 
class 1 examples (human speech) has been incorrectly identified as class 1 
(non-human speech). Similarly, the red block showing 1.2% mean 1.2% of 
the examples of class 2 (non-human speech) has been incorrectly identified 
as Class 1 (human speech). The blue block shows that our neural network is 
able to distinguish between human speech and non-human speech with an 
accuracy of 98.6%. 


Conclusion and Next Steps 
This is our conclusion 


Conclusion 


From our experimental results, we can see that our Blind Source Separation 
algorithm is able to separate the two independent components of the mixed 
signal very well. After the separation, the neural network can distinguish 
between human speech and non-human sound with an accuracy of 98.6%. 
Consequently, our entire system can successfully output the human speech 
from a linear mixture of sound signals. 


Next Steps 


Implement the system in real-time. Our design currently works in the 
Matlab environment. It is implemented to deal with audio that is already 
recorded, but it does not deal with real-time audio streams. We eventually 
would like the system to work in real-time, separating the sources, forward 
speech, if any, and suppress other signals instantly so that people can 
benefit from this system. This implies improving the efficiency of the 
algorithms, and making use of Matlab SimuLink to explore the possibility 
of real-time implementation of our system. 


Additionally, we would like to explore the possibility to separate several 
sources in addition to human speech. As mentioned in the introduction, 
other signals from the surrounding might also be important, such as a car's 
horn. We can improve the separation algorithm and train the neural network 
so that the system can recognize not only human speech, but also other 
important signals from the environment. With this implementation, users 
can have the option to forward in the signals they want. In order to achieve 
that, one needs to increase the number of microphones to increase the 
number of observed signals, and re-train the neural network so that it 
recognizes different types of sound signals. 
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RICE 


Rice University Department of Electrical and Computer Engineering 


Objective 


Explore ways to build a selective transparent 
headphone that propagates speech signal and 
attenuates all other types of signals. 


Project Overview 


~ Since our objective is to blindly separate an input signal into its 
principaliindependent components, we will use independent 
‘component analysis to separate the signal into its components. 

- Then, we will utilize a neural network to select which output signal. if 
any, contains human speech, and forward it to the output. 


Independent Component Analysis 


y= input signals (sound mixtures) 
5 = original sources 
* ICA attempts to find the independent sources (individual sounds) 
that comprise an input signal by solving the following formulation 
for A’ 


ve=As 
» Ais the mixing matrix used to combine the original sources, s, to 
obtain the observed input signals, y- 


Blind Source Separation (BSS) 

* Transform M microphone inputs into frequency domain by taking 

the short-time Fourier transform (STFT). 

0.) =LY (0.0....X,,@0F 
Find subspace filter to reduce reflections and ambient sound. 
Wo) =A77EF 

* Find ICA filter matrix, U(@), for all sampled frequencies. 
* Separation filter: B(@)=U(a)W (a) 
* Scaling fitter:  87(@)=diagB> 


* Permutation fitter: Pp Deere, 
* Final fitter: F(@)= K@)B(@)B() 


For questions, please contact: 


Zichao Wanoince edu: Tianvi Yao@rice edie Stephen Xiai@rice edu; Sher 


» We used a three-layer neural network to distinguish between 
human speech and background noise. 

* More specifically, our neural network consists of an input layer, a 
hidden layer, and an output layer and uses the backpropagation 
algorithm and gradient descent to minimize the cost function. 
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